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ABSTRACT 



A microprocessor and method of processing instructions 
therein are disclosed. Initially, a sequence of instructions is 
dispatched by a dispatch unit of the microprocessor. A code 
sequence recognition unit (CSR) is configured to detect a 
short branch sequence within the sequence of instruction, 
where the short branch sequence includes a condition setting 
instruction, a conditional branch, and at least one additional 
instruction that is executed if the conditional branch is not 
taken. The short branch sequence is then internally con- 
verted to a predicated instruction sequence that includes the 
condition setting instruction and a predicated instruction 
corresponding to each additional instruction in the short 
branch sequence. The predicated instruction sequence is 
then executed in at least one functional unit of the processor. 
Detecting the short branch sequence may include calculating 
the relative branch address associated with the conditional 
branch instruction and comparing the relative branch 
address to a specified maximum. In one embodiment, the 
received sequence of instructions may be converted into an 
instruction group by the processor. In this embodiment, the 
specified maximum number of instructions in a short branch 
sequence may be a function of the number of instructions in 
an instruction group. In an embodiment where the condi- 
tional branch statement is preferably allocated to the last slot 
of the instruction group, the additional instructions in the 
short branch sequence are located in the next subsequent 
instruction group. Converting the short branch sequence to 
the predicated instruction sequence may include converting 
each additional instruction in the short branch sequence to 
an analogous predicated instruction. Id one embodiment, 
converting each additional instruction to its analogous predi- 
cated instruction includes determining a predicated instruc- 
tion opcode for each additional instruction in the short 
branch sequence by adjusting the opcode of each additional 
instruction by a predetermined offset. In another 
embodiment, the opcode conversion may be accomplished 
with an opcode lookup table. 

24 Claims, 6 Drawing Sheets 
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CONVERTING SHORT BRANCHES TO 
PREDICATED INSTRUCTIONS 

BACKGROUND 

1 . Field of the Present Invention 

The present invention generally relates to the field of 
microprocessor architectures and more particularly to a 
microprocessor utilizing an instruction group architecture 
and logic for detecting code sequences within an instruction 
group that are suitable for conversion to one or more 
predicated instruction. 

2. History of Related Art 

As microprocessor technology has enabled gigahertz 
performance, a major challenge for microprocessor design- 
ers is to take advantage of state-of-the-art technologies while 
maintaining compatibility with the enormous base of 
installed software designed for operation with a particular 
instruction set architecture (ISA). To address this problem, 
designers have implemented "layered architecture" micro- 
processors that are adapted to receive instructions formatted 
according to an existing ISA and to convert the instruction 
format of the received instructions to an internal ISA that is 
more suitable for operation in gigahertz execution pipelines. 

Because a layered architecture adds to the processor 
pipeline and increases that number of instructions that are 
potentially "in flight," at a given time, the branch mispredict 
penalty associated with a layered architecture is of great 
concern. One approach to minimizing branch misprediction 
penalties attempts simply to reduce the number of branch 
instructions. Since branch misprediction can only occur on 
a branch instruction, a code sequence containing no branch 
instructions can never be mispredicted. A well known 
method for reducing the number of branch instructions in a 
code sequence is includes the use of predicated instructions. 
Predicated instructions refer to instructions that perform a 
function, such as a fixed point add, if a condition that is 
specified in the predicated instruction itself, is satisfied. If 
the condition is not satisfied, instruction is treated as a NOR 

Predicated instructions can beneficially replace a code 
sequence that includes a condition setting instruction (such 
as a compare) followed by a conditional branch instruction 
and a short code sequence that is executed depending upon 
the status of the condition. In such a sequence, the condi- 
tional branch is used to branch around the relatively short 
code sequence depending upon the state of the condition. In 
the predicated instruction implementation of such a code 
sequence, the conditional branch statement is eliminated and 
each of the instructions in the short code sequence is 
replaced with a predicated instruction. As an example, the 
code sequence: 

COMP Rl, 0 //condition setting instruction 

BEQ LBL //Branch to LBL1 if Rl-0 
ADD R2, R3, R4 
ADD R2, R2, R5 

LBL1, NOP 

could be replaced with predicated instructions as follows: 
COMP Rl, 0 //condition setting instruction 
PADD R2, R3, R4, NE //predicate add executed only if 

condition (NE) is true 
PADD R2, R2, RS, NE //predicate add executed only if 

condition (NE) is true 
Typically, predicated instructions are generated from high 
level source code by a compiler designed for use with an 
instruction set and hardware that support predicated instruc- 
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tions. The predicated instructions may have a distinct 
opcode from their non-predicated analogies. When compil- 
ing code for an instruction set that does not include predi- 
cated instructions, however, the compiler is forced to pro- 

s duce executable code that includes the conditional branch 
statement. It would be highly desirable to implement pro- 
cessor hardware capable of recognizing a code sequence 
characterized by a short branch and further capable of 
converting the sequence to a predicated code sequence 

10 during instruction decode or dispatch and executing the 
predicated code sequence. It would be further desirable if 
this predicated instruction conversion were transparent to 
system user such that recompiling of existing code would 
not be required to take advantage of the predicated execution 

15 hardware. 

SUMMARY OF THE INVENTION 

The goals described above are achieved with micropro- 
cessor and method of processing instructions therein as 

20 disclosed herein. Initially, a sequence of instructions is 
dispatched by a dispatch unit of the microprocessor. A code 
sequence recognition unit (CSR) is configured to detect a 
short branch sequence within the sequence of instruction, 
where the short branch sequence includes a condition setting 

25 instruction, a conditional branch, and at least one additional 
instruction that is executed if the conditional branch is not 
taken. The short branch sequence is then internally con- 
verted to a predicated instruction sequence that includes the 
condition setting instruction and a predicated instruction 

30 corresponding to each additional instruction in the short 
branch sequence. The predicated instruction sequence is 
then executed in at least one functional unit of the processor. 
Detecting the short branch sequence may include calculating 
the relative branch address associated with the conditional 

35 branch instruction and comparing the relative branch 
address to a specified maximum. In one embodiment, the 
received sequence of instructions may be converted into an 
instruction group by the processor. In this embodiment, the 
specified maximum number of instructions in a short branch 

40 sequence may be a function of the number of instructions in 
an instruction group. In an embodiment where the. condi- 
tional branch statement is preferably allocated to the last slot 
of the instruction group, the additional instructions in the 
short branch sequence are located in the next subsequent 

45 instruction group. Converting the short branch sequence to 
the predicated instruction sequence may include converting 
each additional instruction in the short branch sequence to 
an analogous predicated instruction. In one embodiment, 
converting each additional instruction to its analogous predi- 

50 cated instruction includes determining a predicated instruc- 
tion opcode for each additional instruction in the short 
branch sequence by adjusting the opcode of each additional 
instruction by a predetermined offset. In another 
embodiment, the opcode conversion may be accomplished 

55 with an opcode lookup table. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Other objects and advantages of the invention will 
become apparent upon reading the following detailed 

60 description and upon reference to the accompanying draw- 
ings in which: 

FIG. 1 is a block diagram of selected components of a data 
processing system including a microprocessor according to 
one embodiment of the present invention; 

65 FIG. 2 is a block diagram of selected components of a 
microprocessor according to one embodiment of the present 
invention; 
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FIG. 3 illustrates examples of the instruction cracking PC9733&4C/Y 1.0 and PC98/99 Compliant SuperllO data 

function performed by one embodiment of the processor of sheet from National Semiconductor Corporation (November 

FIG. 2; 1998) at www.national.com. Thus, as configured in FIG. 1, 

FIG. 4 is a block diagram illustrating selected portions of svslem 100 includes processing means in the form of 

a microprocessor according to one embodiment of the 5 processors 101, storage means including system memory 

invention- anc * mass stora S e 1^*> m P ul means such as keyboard 

r ■ Li i j- e i_ ^ , ^ * 109 and mouse 110, and output means including speaker 111 

FIG. 5 is block diagram of a basic cache block of the , , / _\. . r , 

' o rocessor of FIG 2' display 136. In one embodiment a portion of system 

microprocessor o . , memory 250 and mass storage 104 collectively store an 

FIG. 6 is an illustration of various branching scenarios 1Q operating system such as the AIX® operating system from 

that the processor of FIG. 2 may encounter; and jb M Corporation or other suitable operating system to 

FIG. 7 is a block diagram of a completion table suitable coordinate the functions of the various components shown in 

for use with the present invention. FIG. 1. Additional detail concerning the AIX operating 

While the invention is susceptible to various modifica- system is available in AIX Version 4.3 Technical Reference: 

tions and alternative forms, specific embodiments thereof 15 Base Operating System and Extensions, Volumes 1 and 2 

are shown by way of example in the drawings and will (order numbers SC23-4159 and SC23-4160); AIX Version 

herein be described in detaiL It should be understood, 4,3 System User's Guide: Communications and Networks 

however, that the drawings and detailed description pre- (order number SC23-4122); and AIX Version 4.3 System 

sen ted herein are not intended to limit the invention to the User's Guide: Operating System and Devices (order number 

particular embodiment disclosed, but on the contrary, the 20 SC23-4121) from IBM Corporation at www.ibm.com and 

intention is to cover all modifications, equivalents, and incorporated by reference herein. 

alternatives falling within the spirit and scope of the present Turning now to FIG. 2, a simplified block diagram of a 

invention as defined by the appended claims. processor 101 according to one embodiment of the present 

DETAILED DESCRIPTION OF A PREFERRED bS^mSS^Sa^mJS^t^^^i 

EMBODIMENT OF THE PRESENT INVENTION a of ^ nexl to be fetched . ^ ^ 

Referring now to FIG. 1, an embodiment of a data tion address generated by fetch unit 202 provided to an 
processing system 100 according to the present invention is instruction cache 210. Fetch unit 202 may include branch 
depicted. System 100 includes one or more central process- prediction logic that, as its name suggests, is adapted to 
ing units (processors) 101a, lOlfc, 101c, etc. (collectively or 30 make an informed prediction of the outcome of a decision 
generically referred to as processor(s) 101. In one that effects the program execution flow. The ability to 
embodiment, each processor 101 may comprise a reduced correctly predict branch decisions is a significant factor in 
instruction set computer (RISC) microprocessor. Additional the overall ability of processor 101 to achieve improved 
information concerning RISC processors in general is avail- performance by executing instructions speculatively and 
able in C. May et al. Ed., PowerPC Architecture: A Sped- 35 out-of-order. The instruction address generated by fetch unit 
fication for a New Family of RISC Processors, (Morgan 202 is provided to an instruction cache 210, which contains 
Kaufmann, 1994 2d edition). Processors 101 are coupled to a subset of the contents of system memory in a high speed 
system memory 250 and various other components via storage facility. The instructions stored in instruction cache 
system bus 113. Read only memory (ROM) 102 is coupled 210 are preferably formatted according to a first ISA, which 
to the system bus 113 and may include a basic input/output 40 is typically a legacy ISA such as, for example, the PowerPC 
system (BIOS), which controls certain basic functions of or an x86 compatible instruction set. Detailed information 
system 100. FIG. 1 further depicts an I/O adapter 107 and a regarding the PowerPC® instruction set is available in the 
network adapter 106 coupled to the system bus 113. I/O PowerPC 620 RISC Microprocessor User's Manual avail- 
adapter 107 links system bus 113 with mass storage devices able from Motorola, Inc. (Order No. MPC620UM/AD), 
104 such as a hard disk 103 and/or a tape storage drive 105. 45 which is incorporated by reference herein. If the address 
Network adapter 106 interconnects bus 113 with an external instruction generated by fetch unit 202 corresponds to a 
network enabling data processing system 100 to communi- system memory location that is currently replicated in 
cate with other such systems. Display monitor 136 is con- instruction cache 210, instruction cache 210 forwards the 
nected to system bus 113 by display adapter 112, which may corresponding instruction to cracking unit 212. If the 
include a graphics adapter to improve the performance of 50 instruction corresponding to the instruction address gener- 
graphics intensive applications and a video controller. In one ated by fetch unit 202 does not currently reside in instruction 
embodiment, adapters 107, 106, and 112 may be connected cache 210 (i.e., the instruction address provided by fetch unit 
to one or more I/O busses that are connected to system bus 202 misses in instruction cache 210), the instructions must 
113 via an intermediate bus bridge (not shown). Suitable I/O be fetched from an L2 cache (not shown) or system memory 
busses for connecting peripheral devices such as hard disk 55 before the instruction can be forwarded to cracking unit 212. 
controllers, network adapters, and graphics adapters include Cracking unit 212 is adapted to modify an incoming 
the Peripheral Components Interface (PCI) bus as specified instruction stream to produce a set of instructions optimized 
according to PCI Local Bus Specification Rev. 2.2 available for executing in an underlying execution pipeline at high 
from the PCI Special Interest Group, Hillsboro, Oreg., and operating frequencies (i.e., operating frequencies exceeding 
incorporated by reference herein. Additional input/output 60 1 GHz). In one embodiment, for example, cracking unit 212 
devices are shown as connected to system bus 113 via user receives instructions in a 32-bit wide ISA such as the 
interface adapter 108. A keyboard 109, mouse 110, and instruction set supported by the PowerPC® microprocessor 
speaker 111 are all linked to bus 113 via user interface and converts the instructions to a second, preferably wider, 
adapter 108, which may include, for example, a Superl/O ISA that facilitates execution in a high speed execution unit 
chip integrating multiple device adapters into a single inte- 65 operating in the gigahertz frequency range and beyond. The 
grated circuit. For additional information concerning one wider format of the instructions generated by cracking unit 
such chip, the reader is referred to the PC87338/ 212 may include, as an example, explicit fields that contain 
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information (such as operand values) that is merely implied load multiple of six consecutive memory locations breaks 

or referenced in the instructions received by cracking unit down into six load instructions. Because each group 302 

212, which are formatted according to a first format. In one according to the depicted embodiment of processor 101 

embodiment, for example, the ISA of instructions generated includes, at most, five instructions, and because the fifth slot 

by cracking unit 212 is 64 or more bits wide. s 304e is reserved for branch instructions, a load multiple of 

In one embodiment, cracking unit 212 as contemplated six registers breaks down into two groups 302a and 3026 

herein, in addition to converting instructions from a first respectively. Four of the load instructions are stored in the 

format to a second, and preferably wider, format, is designed first group 302a while the remaining two load instructions 

to organize a set of fetched instructions into instruction are stored in the second group 3026. Thus, in Example 3, a 

"groups" 302, examples of which are depicted in FIG. 3. 10 single instruction is broken down into a set of instructions 

Each instruction group 302 includes a set of instruction slots that span multiple instruction groups 302. 

304a, 3046, etc. (collectively or generically referred to as Returning now to FIG. 2, the instruction groups 302 

instruction slots 304). The organization of a set of instruc- generated by the preferred embodiment of cracking unit 212 

tions into instruction groups facilitates high speed execution are forwarded to a basic block cache 213 where they are 

by, among other things, simplifying the logic needed to 15 stored pending execution. Referring to FIG. 5, an embodi- 

maintain rename register mapping and completion tables for ment of basic block cache 213 is depicted. In the depicted 

a large number of in-flight instructions. embodiment, basic block cache 213 includes a set of entries 

In FIG. 3, three examples of instruction grouping that may 502a through S02n (generically or collectively referred to as 

be performed by cracking unit 212 are depicted. In Example basic block cache entries 502). In one embodiment, each 

1, a set of instructions indicated by reference numeral 301 is 2 o cntrv 502 in basic block cache 213 contains a single instruc- 

transformed into a single instruction group 302 by cracking tion group 302. In addition, each entry 502 may include an 

unit 212. In the depicted embodiment of the invention, each entry identifier 504, a pointer 506, and an instruction address 

instruction group 302 includes five slots indicated by refer- (IA) field 507. The instruction address field 507 for each 

ence numerals 304a, 3046, 304c, 304a", and 304e. Each slot entry 502 is analogous to the IA field 704 of completion 

304 may contain a single instruction. In this embodiment, 2 s table 218. In one embodiment, each entry 502 in basic block 

each instruction group may include a maximum of five cache 504 corresponds to an entry in completion table 218 

instructions. In one embodiment, the instructions in the set and the instruction address field 507 indicates the instruction 

of instructions 301 received by cracking unit 212 are for- address of the first instruction in the corresponding instruc- 

matted according to a first ISA, as discussed previously, and tion group 302. In one embodiment, the pointer 506 indi- 

the instructions stored in the groups 302 are formatted 30 cates the entry identifier of the next instruction group 302 to 

according to a second wider format. The use of instruction be executed based upon a branch prediction algorithm, 

groups simplifies renaming recovery and completion table branch history table, or other suitable branch prediction 

logic by reducing the number of instructions that must be mechanism. 

individually tagged and tracked. The use of instruction As indicated previously, the preferred implementation of 
groups thus contemplates sacrificing some information 35 forming instruction groups 302 with cracking unit 212 
about each instruction in an effort to simplify the process of allocates branch instructions to the last slot 304 in each 
tracking pending instructions in an out-of-order processor. group 302. In addition, the preferred embodiment of crack- 
Example 2 of FIG. 3 illustrates a second example of the ing unit 212 produces instruction groups 302 in which the 
instruction grouping performed by cracking unit 212 accord- number of branch instructions in a group 302 to one (or less), 
ing to one embodiment of the invention. This example 40 In this arrangement, each instruction group 302 can be 
demonstrates the capability of cracking unit 212 to break thought of as representing a "leg" of a branch tree 600 as 
down complex instructions into a group of simple instruc- indicated in FIG. 6, in which instruction groups 302 are 
tions for higher speed execution. In the depicted example, a represented by their corresponding instruction group entry 
sequence of two load-with -update (LDU) instructions are 504 values. First instruction group 302a, for example, is 
broken down into an instruction group including a pair of 45 indicated by its entry number (1), and so forth. Suppose, as 
load instructions in slots 304a and 304c respectively and a an example, that the branch prediction mechanism of pro- 
pair of ADD instructions in slots 3046 and 304o* respec- cessor 101 predicts that leg 2 (corresponding to second 
tively. In this example, because group 302 does not contain group 3026) will be executed following leg 1 and that leg 3 
a branch instruction, the last slot 304e of instruction group will be executed following leg 2. The basic block cache 213, 
302 contains no instruction. The PowerPC® load-with- 50 according to one embodiment of the invention, reflects these 
update instruction, like analogous instructions in other branch predictions by setting the pointer 506 to indicate the 
instruction sets, is a complex instruction in that the instruc- next group 302 to be executed. The pointer 506 of each entry 
tion affects the contents of multiple general purpose registers 502 in basic block cache 213 may be utilized to determine 
(GPRs). Specifically, the load-with-update instruction can the next instruction group 302 to be dispatched, 
be broken down into a load instruction that affects the 55 Basic block cache 213 works in conjunction with a block 
contents of a first GPR and an ADD instruction that affects fetch unit 215 analogous to the manner in which fetch unit 
the contents of a second GPR. Thus, in instruction group 302 202 works with instruction cache 210. More specifically, 
of Example 2 in FIG. 3, instructions in two or more block fetch unit 215 is responsible for generating an instruc- 
instruction slots 304 correspond to a single instruction tion address that is provided to basic block cache 213. The 
received by cracking unit 212. 60 instruction address provided by block fetch unit 215 is 
In Example 3, a single instruction entering cracking unit compared against addresses in the instruction address fields 
212 is broken down into a set of instructions occupying 507 in basic block cache 213. If the instruction address 
multiple groups 302. More specifically, Example 3 illus- provided by block fetch unit 213 hits in basic block cache 
trates a load multiple (LM) instruction. The load multiple 213, the appropriate instruction group is forwarded to issue 
instruction (according to the PowerPC® instruction set) 65 queues 220. If the address provided by block fetch unit 215 
loads the contents of consecutive locations in memory into misses in basic block cache 213, the instruction address is 
consecutively numbered GPRs. In the depicted example, a fed back to fetch unit 202 to retrieve the appropriate instruc- 
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lions from instruction cache 210. In one embodiment suit- 
able for its conservation of area (die size), basic block cache 

213 enables the elimination of instruction cache 210. In this 
embodiment, instructions are retrieved from a suitable stor- 
age facility such as an 12 cache or system memory and 
provided directly to cracking unit 212. If an instruction 
address generated by block fetch unit 213 misses in basic 
block cache 213, the appropriate instructions are retrieved 
from an 12 cache or system memory rather than from 
instruction cache 210. 

The depicted embodiment of processor 101 further indi- 
cates a dispatch unit 214. Dispatch unit 214 is responsible 
for ensuring that all necessary resources are available prior 
to forwarding the instructions in each instruction group to 
their appropriate issue queues 220. In addition, dispatch unit 

214 communicates with dispatch and completion control 
logic 216 to keep track of the order in which instructions 
were issued and the completion status of these instructions 
to facilitate out-of-order execution. In the embodiment of 
processor 101 in which cracking unit 212 organizes incom- 
ing instructions into instruction groups as discussed above, 
each instruction group 302 is assigned a group tag (GTAG) 
by completion and control logic 216 that conveys the 
ordering of the issued instruction groups. As an example, 
dispatch unit 214 may assign monotonically increasing 25 
values to consecutive instruction groups. With this 
arrangement, instruction groups with lower GTAG values 
are known to have issued prior to (i.e., are younger than) 
instruction groups with larger GTAG values. Although the 
depicted embodiment of processor 101 indicates dispatch 30 
unit 214 as a distinct functional block, the group instruction 
organization of basic block cache 213 lends itself to incor- 
porating the functionality of dispatch unit 214. Thus, in one 
embodiment, dispatch unit 214 is incorporated within basic 
block cache 213, which is connected directly to issue queues 
220. 

In association with dispatch and completion control logic 
216, a completion table 218 is utilized in one embodiment 
of the present invention to track the status of issued instruc- 
tion groups. Turning to FIG. 7, a block diagram of one 
embodiment of completion table 218 is 20 presented. In the 
depicted embodiment, completion table 218 includes a set of 
entries 702a through 702/j (collectively or generically 
referred to herein as completion table entries 702). In this 
embodiment, each entry 702 in completion table 218 
includes an instruction address (IA) field 704 and a status bit 
field 706. In this embodiment, the GTAG value of each 
instruction group 302 identifies the entry 702 in completion 
table 218 in which completion information corresponding to 
the instruction group 302 is stored. Thus, the instruction 
group 302 stored in entry 1 of completion table 118 will have 
a GTAG value of 1, and so forth. In this embodiment, 
completion table 118 may further include a "wrap around" 
bit to indicate that an instruction group with a lower GTAG 
value is actually younger than an instruction group with a 
higher GTAG value. In one embodiment, the instruction 
address field 704 includes the address of the instruction in 
first slot 304a of the corresponding instruction group 302. 
Status field 706 may contain one or more status bits indica- 
tive of whether, for example, the corresponding entry 702 io 
completion table 218 is available or if the entry has been 
allocated to a pending instruction group. 

In the embodiment of processor 101 depicted in FIG. 2, 
instructions are issued from dispatch unit 214 to issue 
queues 220 where they await execution in corresponding 
execution pipes 222. Processor 101 may include a variety of 
types of executions pipes, each designed to execute a subset 
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of the processor's instruction set. In one embodiment, 
execution pipes 222 may include a branch unit pipeline 224, 
a load store pipeline 226, a fixed point arithmetic unit 228, 
and a floating point unit 230. Each execution pipe 222 may 
comprise two or more pipeline stages. Instructions stored in 
issue queues 220 may be issued to execution pipes 222 using 
any of a variety of issue priority algorithms. In one 
embodiment, for example, the oldest pending instruction in 
an issue queue 220 is the next instruction issued to execution 
pipes 222. In this embodiment, the GTAG values assigned 
by dispatch unit 214 are utilized to determine the relative age 
of instructions pending in the issue queues 220. Prior to 
issue, the destination register operand of the instruction is 
assigned to an available rename GPR. When an instruction 
is ultimately forwarded from issue queues 120 to the appro- 
priate execution pipe, the execution pipe performs the appro- 
priate operation as indicated by the instruction's opcode and 
writes the instruction's result to the instruction's rename 
GPR by the time the instruction reaches a finish stage 
(indicated by reference numeral 132) of the pipeline. A 
mapping is maintained between the rename GPRs and their 
corresponding architected registers. When all instructions in 
an instruction group (and all instructions in younger instruc- 
tion groups) finish without generating an exception, a 
completion pointer in the completion table 218 is incre- 
mented to the next instruction group. When the completion 
pointer is incremented to a new instruction group, the 
rename registers associated with the instructions in the old 
instruction group are released thereby committing the results 
of the instructions in the old instruction group. If one or 
more instructions older than a finished (but not yet 
committed) instruction generates an exception, the instruc- 
tion generating the exception and all younger instructions 
are flushed and a rename recovery routine is invoked to 
return the GPR mapping to the last known valid state. 

If a predicted branch is not taken (branch misprediction), 
the instructions pending in executions pipes 222 and issue 
queues 220 are flushed. In addition, the pointer 506 of the 
basic block cache entry 502 associated with the mispredicted 
branch is updated to reflect the most recent branch taken. An 
example of this updating process is illustrated in FIG. 5 for 
the case in which program execution results in a branch from 
leg 1 (instruction group 302a) to leg 4 (instruction group 
302o% Because the pointer 506 of entry 502a had previously 
predicted a branch to the instruction group residing in the 
number 2 entry of basic block cache 213 (i.e., group 3026), 
the actual branch from instruction group 302a to group 302tf 
was mispredicted. The mispredicted branch is detected and 
fed back to block fetch unit 215, the instructions pending 
between basic block cache 213 and the finish stage 232 of 
each of the pipelines 222 are flushed, and execution is 
re-started with instruction group 302a" in entry 4 of basic 
block cache 213. In addition, the pointer 506 of basic block 
cache entry 502a is altered from its previous value of 2 to its 
new value of 4 reflecting the most recent branch informa- 
tion. By incorporating basic block cache 213 and block fetch 
unit 215 in close proximity to the execution pipelines 222, 
the present invention contemplates a reduced performance 
penalty for a mispredicted branch. More specifically, by 
implementing basic block cache 213 on the "downstream" 
side of instruction cracking unit 212, the present invention 
eliminates instructions that are pending in cracking unit 212 
from the branch misprediction flush path thereby reducing 
the number of pipeline stages that must be purged following 
a branch mispredict and an reducing the performance pen- 
alty. In addition, the basic block cache 213 contemplates a 
caching mechanism with a structure that matches the orga- 
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nization of dispatch and completion control unit 216 and 
completion table 218 thereby simplifying the organization of 
the intervening Logic and facilitating the implementation of 
useful extensions to the basic block cache 213 as described 
below. 

The performance penalty caused by branch misprediction 
is minimized in processor 101 according to the present 
invention by the inclusion of a code sequence recognition 
unit (CSR). The CSR is preferably configured to detect code 
sequences that include a short branch sequence. A short 
branch sequence is characterized by a condition setting 
instruction followed by a conditional branch instruction and 
a short sequence of "substantive" instructions. The condition 
setting instruction is typically an instruction, such as a 
compare instruction, that alters the contents of the condition 
register in the PowerPC® architecture or an analogous 
register in another architecture. The substantive instructions 
are typically instructions, such as add instructions, that affect 
the contents of one or more general purpose or floating point 
registers. If the CSR detects the presence of a short branch 
sequence, it generates a functionally equivalent predicated 
code sequence by eliminating the branch instruction from 
the sequence and replacing each of the substantive instruc- 
tions with an analogous predicated instruction. 

In an embodiment of processor 101 that includes a 
cracking unit 212 as described previously, the CSR, as 
indicated in FIG. 4 by reference numeral 402, may be 
embedded in cracking unit 212. In this embodiment, CSR 
402 is configured to detect the presence of a conditional 
branch instruction in an instruction group 302. If a condi- 
tional branch is detected, the branch address of the condi- 
tional branch instruction is compared to the instruction 
address of the conditional branch instruction to determine if 
the sequence constitutes a short branch sequence, A short 
branch sequence might be delineated as a branch sequence 
that includes no more substantive instructions than the 
number of instructions an instruction group 302 may con- 
tain. If, for example, instruction groups 302 produced by 
cracking unit 212 may include four substantive instructions, 
the upper limit on the number of substantive instructions io 
a short branch sequence may be limited to four. 

In an architecture that employs fixed length instructions, 
the number of instructions that a conditional branch instruc- 
tion jumps around (when taken) may be calculated by 
dividing the offset between the instruction addresses of the 
conditional branch instruction and the branch target by the 
number of bytes per instruction. If the number of instruc- 
tions within the branch loop does not exceed the maximum 
number of instructions permitted for a short branch 
sequence, CSR 402 converts the code sequence to a func- 
tionally equivalent predicated code sequence 404 by delet- 
ing the conditional branch instruction and converting each 
substantive instruction to its predicated equivalent. 

In one embodiment, the code sequence conversion pro- 
cess is simplified by implementing predicated instructions 
with an opcode that is a function of its non-predicated 
equivalent. Each predicated instruction, for example, may be 
assigned an opcode that is a fixed offset from its correspond- 
ing non-predicated opcode. In another embodiment suitable 
for architectures in which it is not feasible to implement 
predicated instruction opcodes as a function of non- 
predicated opcodes, CSR 402 may utilized an opcode 
lookup table 403 that specifies the predicated opcode 
equivalent of each substantive instruction. During 
conversion, CSR 402 retrieves the appropriate predicated 
instruction opcode for each non-predicated substantive 
instruction in the original code sequence. The operands of 
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the predicated instruction remain the same as the non- 
predicated instruction. The condition code execution of the 
predicated instruction is based is determined from the con- 
ditional branch instruction of the original code sequence and 
included in each predicated instruction. 

In an embodiment in which cracking unit 212 groups 
instructions as discussed previously with branch instructions 
preferably allocated to the last slot in an instruction group 
302, CSR 402 may be configured to evaluate the contents of 
consecutive instruction groups in forming a predicated 
instruction sequence. In this embodiment, a code sequence 
is detected by examining the last slot in each instruction 
group 302. If the last slot contains a conditional branch 
instruction, CSR 402 determines if the branch target con- 
stitutes a short branch as discussed previously. If the maxi- 
mum length of a short branch is defined as the maximum 
number of substantive instructions contained in an instruc- 
tion group 302, CSR 402 needs only to convert the substan- 
tive instructions in the instruction group 302 immediately 
following the group 302 that contained the conditional 
branch. To support predicated execution as contemplated 
herein, each of the execution units including FXU 228, FPU 
230, LSU 226 is configured to execute predicated instruc- 
tions. In one embodiment, the predicated circuit might 
include a preliminary execution pipeline stage in which the 
predicate condition is evaluated. If the predicate condition is 
true, the execution unit executes the instruction in the 
conventional manner. If the predicate condition is false, the 
instruction must be retired from the execution pipeline in a 
manner that a) leaves the target register in the same state as 
it was prior to executing the instruction and b) informs other 
instructions that are dependent on the result of the predicated 
instruction that they can proceed. These objectives could be 
met by discarding the result of the predicated instruction and 
broadcasting to all dependent instructions that the result of 
the predicated instruction is committed. In another 
embodiment, the predicated instruction could issue as two 
instructions only one of whose result is committed. One of 
the instructions would perform the same calculation (or 
other function) as the predicated instruction and the other 
instruction would perform no function. The instruction that 
is committed depends upon the predicate condition. In still 
another embodiment, the predicate instruction, upon deter- 
mining that the predicate condition is false, could perform a 
read of the target register followed by an immediate write 
back to simulate a NOP while simultaneously exercising the 
renaming logic. 

It will be apparent to those skilled in the art having the 
benefit of this disclosure that the present invention contem- 
plates improved performance by enabling hardware conver- 
sion of short branch code sequences to predicated execution 
code sequences and enabling the execution of the converted 
sequences. It is understood that the form of the invention 
shown and described in the detailed description and the 
drawings are to be taken merely as presently preferred 
examples. It is intended that the following claims be inter- 
preted broadly to embrace all the variations of the preferred 
embodiments disclosed. 

What is claimed is: 

1. A method of processing instructions in a 
microprocessor, comprising: 

receiving a sequence of instructions; 

detecting a short branch sequence within the sequence of 
instructions, wherein the short branch sequence 
includes a condition setting instruction, a conditional 
branch, and at least one additional instruction, wherein 
the additional instruction is conditionally executed if 
the conditional branch is not taken; 
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internally converting the short branch sequence to a 14. The microprocessor of claim 13 wherein the CSR is 
predicated instruction sequence including the condition enabled to calculate the relative branch address associated 
setting instruction and a predicated instruction corre- with the conditional branch instruction and to compare the 
sponding to each additional instruction in the short relative branch address to a specified maximum, 
branch sequence; and s 15. The microprocessor of claim 14, wherein the micro- 
executing the predicated instruction sequence in at least processor is enabled to organize the received sequence of 
one functional unit of the processor. instructions into an instruction group, and wherein the 

2. The method of claim t wherein detecting the short specified maximum relative branch address is a function of 
branch sequence includes calculating the relative branch the number of instructions in an instruction group, 
address associated with the conditional branch instruction. 10 16. The microprocessor of claim 13, wherein the CSR is 

3. The method of claim 2, further comprising comparing configured to convert each of the at least one additional 
the relative branch address to a specified maximum. instruction in the short branch sequence to an analogous 

4. The method of claim 3, further comprising, organizing predicated instruction. 

the received sequence of instructions into an instruction 17. The microprocessor of claim 16, wherein the CSR is 

group, and wherein the specified maximum relative branch 15 enabled to convert each of the at least one additional 

address is a function of the number of instructions in an instruction to its analogous predicated instruction by deter- 

instruction group. mining a predicated instruction opcode for each additional 

5. The method of claim 1, further comprising, forming an instruction in the short branch sequence. 

instruction group from sequence of instructions, wherein 18. The microprocessor of claim 17, wherein the CSR 

detecting the short branch sequence comprises detecting a 20 determines the predicated instruction opcode includes 

conditional branch instruction in the instruction group. adjusting the opcode of each additional instruction by a 

6. The method of claim 5, wherein the conditional branch predetermined offset. 

statement is located in a last slot of the instruction group. 19. The microprocessor of claim 17, wherein determining 

7. The method of claim 6, wherein each additional instruc- the predicated instruction opcode includes retrieving a preop- 
tion in the short branch sequence is located in the instruction 25 cated instruction opcode from an opcode lookup table, 
group following the instruction group containing the con- 20. A data processing system including processor, 
ditional branch statement. memory, input means, and display, wherein the processor 

8. The method of claim 1, wherein converting the short comprises: 

branch sequence to the predicated instruction sequence a dispatch unit suitable for dispatching a sequence of 

includes converting each additional instruction in the short 30 instructions* 

branch sequence to an analogous predicated instruction. ^ A •* /^om ui j , j * t 

a-™. i. j * , • o i. . . a code sequence recognition unit (CSR) enabled to detect 

». _lhe method ot claim 8, wnerem converting each a short branch sequence in the sequence of instructions, 

additional instruction to its analogous predicated lostruction wherein ^ shon branch ^ a QOomga 

includes . determining a predicated instruction opcode for ^ a conditional 5rlnch; ^ at least 

each additional instruction in the short branch sequence. „ Jaau: 1 • f _*- <• ,. 

-.n, ^ iLJ rt* « * . - " . „ one additional instruction, and further enabled to con- 

10. The method of clam 9 wherein determining the vert the short branch sequence to a functionally equiva- 
predicated mstruction opcode includes adjusting toe opcode knt edicatcd mstruction and 

of each additional instruction by a predetermined offset. , . 1 1 , 

11. The method of claim 9, wherein determining the at least j onc execution unit enabled to execute the predi- 
predicated instruction opcode includes retrieving a predi- 4 ° ^ ca if d «**ucUon sequence. . 
cated instruction opcode from an opcode lookup table. J** processing system of claim 20, wherein the 

12. The method of claim 1, wherein each predicated ^ » ^ nabl f d *> calcu . late tbe relative branch address 
instruction includes a substantive instruction and a condition associated with the conditional branch instruction and to 
that is set by the condition setting instruction, and wherein com P are *** relative branch address 10 a specified maxi- 
the substantive instruction is executed if the condition is 4 5 mu _ m * 

truc 22. The data processing system of claim 20, wherein the 

13. A microprocessor including: CSR * enabled to convert each of me at least one additional 

. , f j. t ,. £ instructions to its analogous predicated instruction by deter- 

a dispatch unit suitable for dispatching a sequence of ■ • 4 j • , * c u A ..- * 

. \ ± . mming a predicated mstruction opcode for each additional 

instructions; •* *• •<!_ i_ * L L 

50 instruction in the short branch sequence. 

a code sequence recognition unit (CSR) enabled to detect 23. The data processing system of claim 22, wherein the 

a short branch sequence in the sequence of instructions, csr determines the predicated instruction opcode includes 

wherein the short branch sequence includes a condition adjusting the opcode of each additional instruction by a 

setting instruction, a conditional branch, and at least predetermined offset. 

one additional instruction, and further enabled to con- 55 2 4. The microprocessor of claim 22, wherein determining 
vert the short branch sequence to a functionally equiva- me predicated instruction opcode includes retrieving a predi- 
lent predicated instruction sequence; and cated instruction opcode from an opcode lookup table, 
at least one execution unit enabled to execute the predi- 
cated instruction sequence. ♦ + * + * 
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